This paper presents a method, called AOGTracker, for simultaneously tracking,learning and parsing (TLP) of unknown objects in video sequences with ahierarchical and compositional And-Or graph (AOG) representation. %The AOGcaptures both structural and appearance variations of a target object in aprincipled way. The TLP method is formulated in the Bayesian framework with aspatial and a temporal dynamic programming (DP) algorithms inferring objectbounding boxes on-the-fly. During online learning, the AOG is discriminativelylearned using latent SVM to account for appearance (e.g., lighting and partialocclusion) and structural (e.g., different poses and viewpoints) variations ofa tracked object, as well as distractors (e.g., similar objects) in background.Three key issues in online inference and learning are addressed: (i)maintaining purity of positive and negative examples collected online, (ii)controling model complexity in latent structure learning, and (iii) identifyingcritical moments to re-learn the structure of AOG based on its intrackability.The intrackability measures uncertainty of an AOG based on its score maps in aframe. In experiments, our AOGTracker is tested on two popular trackingbenchmarks with the same parameter setting: the TB-100/50/CVPR2013 benchmarks,and the VOT benchmarks --- VOT 2013, 2014, 2015 and TIR2015 (thermal imagerytracking). In the former, our AOGTracker outperforms state-of-the-art trackingalgorithms including two trackers based on deep convolutional network. In thelatter, our AOGTracker outperforms all other trackers in VOT2013 and iscomparable to the state-of-the-art methods in VOT2014, 2015 and TIR2015.
展开▼